Skip to content

[BACKPORT 2025.2][build] Improve error reporting and retry for archive downloads (#31364)#31553

Open
hari90 wants to merge 2 commits into
yugabyte:2025.2from
hari90:backport-3ce79c6e6-2025.2
Open

[BACKPORT 2025.2][build] Improve error reporting and retry for archive downloads (#31364)#31553
hari90 wants to merge 2 commits into
yugabyte:2025.2from
hari90:backport-3ce79c6e6-2025.2

Conversation

@hari90
Copy link
Copy Markdown
Contributor

@hari90 hari90 commented May 11, 2026

Summary

When the third-party archive checksum download returned an HTML error
page
instead of the expected .sha256 file,
download_and_extract_archive.py
only reported the file size, which made the failure hard to diagnose.
There
was also no retry, so a single transient failure (e.g. a 5xx from
GitHub)
would fail the build.

Example failure:

Checksum file size is too big: 55118 bytes

(failing
job
)

Changes

  • download_url now passes -f to curl so HTTP error responses no
    longer
    get written to disk as the requested artifact, and uses --retry /
    --retry-delay to retry transient failures (5xx, connection errors).
  • The "checksum file size is too big" error now includes the first 1024
    bytes of the file so the underlying error (e.g. an HTML page) is visible
    in build logs.

Co-authored-by: Claude noreply@anthropic.com

Original commit: 3ce79c6 / #31364, 9afef01 / #31427


CSI

…e downloads (yugabyte#31364)

## Summary

When the third-party archive checksum download returned an HTML error
page
instead of the expected `.sha256` file,
`download_and_extract_archive.py`
only reported the file size, which made the failure hard to diagnose.
There
was also no retry, so a single transient failure (e.g. a 5xx from
GitHub)
would fail the build.

Example failure:

Checksum file size is too big: 55118 bytes

([failing
job](https://github.com/yugabyte/yugabyte-db/actions/runs/25144698357/job/73701866705?pr=31359))

## Changes

- `download_url` now passes `-f` to curl so HTTP error responses no
longer
  get written to disk as the requested artifact, and uses `--retry` /
  `--retry-delay` to retry transient failures (5xx, connection errors).
- The "checksum file size is too big" error now includes the first 1024
bytes of the file so the underlying error (e.g. an HTML page) is visible
  in build logs.

---------

Co-authored-by: Claude <noreply@anthropic.com>

Original commit: 3ce79c6 / yugabyte#31364
@hari90 hari90 requested review from es1024 and svarnau May 11, 2026 23:56
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements retry logic for archive downloads by adding MAX_DOWNLOAD_ATTEMPTS and RETRY_DELAY_SEC constants and updating the curl command with retry flags. It also improves error reporting in verify_sha256sum by providing a preview of the file content when the checksum file size is unexpectedly large. I have no feedback to provide.

@hari90
Copy link
Copy Markdown
Contributor Author

hari90 commented May 11, 2026

Trigger Jenkins

@hari90
Copy link
Copy Markdown
Contributor Author

hari90 commented May 11, 2026

Jenkins build has been triggered. Results will be posted once it completes. CSI


JenkinsBot

…t the Python level on any curl failure (yugabyte#31427)

## Summary

Thirdparty archive downloads occasionally fail with curl exit status 22
on
GitHub Actions because curl's `--retry` only retries timeouts and HTTP
408/429/5xx, not transient 403s on GitHub's signed release-asset
redirects.
`--retry-all-errors` would cover this but requires curl >= 7.71.0, which
is
unavailable on AlmaLinux 8 / RHEL 8 (curl 7.61.1) and similar runners.

Wrap the curl invocation in a Python retry loop instead, so any curl
failure
is retried regardless of curl version.

Fixes yugabyte#31426.

## Test Plan
Jenkins: compile only

Original commit: 9afef01 / yugabyte#31427
@hari90
Copy link
Copy Markdown
Contributor Author

hari90 commented May 12, 2026

Trigger Jenkins

@hari90
Copy link
Copy Markdown
Contributor Author

hari90 commented May 12, 2026

Jenkins build has been triggered. Results will be posted once it completes. CSI


JenkinsBot

@hari90
Copy link
Copy Markdown
Contributor Author

hari90 commented May 12, 2026

Jenkins build for commit f34f5b3d: Fail
CSI
Reason: CSI status: FAIL

Errors:

Checking for number of tests planned versus executed.

Type C++ Plan Java Plan Planned Executed Status
PR31553-arm-alma8-clang19-release #2 7597 3253 10850 10850 Okay
PR31553-alma8-clang19-release #2 7597 3255 10852 10852 Okay
PR31553-alma8-gcc12-fastdebug #2 8336 3255 11591 11591 Okay
PR31553-mac14-clang-release #2 0 0 0 0 Okay
PR31553-ubuntu22.04-clang19-debug #2 0 0 0 0 Okay
PR31553-alma8-clang19-tsan #2 8321 3060 11381 11379 FAILURE
PR31553-arm-mac14-clang-release #2 11 4 15 15 Okay
PR31553-alma9-clang19-asan #2 8319 3150 11469 11469 Okay

🔨 DB Build/Test Job Summary

Build Total Passed Failed Failed After Retries
PR31553-arm-alma8-clang19-release 10852 10426 5 5
PR31553-alma8-clang19-release 10854 10426 8 8
PR31553-alma8-gcc12-fastdebug 11593 11133 7 7
PR31553-mac14-clang-release 2 2 0 0
PR31553-ubuntu22.04-clang19-debug 2 2 0 0
PR31553-alma8-clang19-tsan 11383 9740 5 5
PR31553-arm-mac14-clang-release 17 17 0 0
PR31553-alma9-clang19-asan 11471 10652 14 14

JenkinsBot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant